STATS 32 Session 3: Data Visualization

Kenneth Tay

Oct 9, 2018

Final project

Goal: Demonstrate that you know how to do data analysis in R

Minimum requirements:

Project proposal

Recap of week 1

Vectors

vec <- c("a", "b", "c")
vec
## [1] "a" "b" "c"
vec[c(2,4)]
## [1] "b" NA

Lists

classes <- list(quarter = "Fall 2018/19",
             ID = c("STATS 32", "STATS 101", "STATS 200"),
             credits = 12)
classes$ID
## [1] "STATS 32"  "STATS 101" "STATS 200"
classes[["credits"]]
## [1] 12

Data frames

A special type of list:

data(mtcars)
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Getting a feel for your data

Weird thing that happened last time…

I want all the rows such that the value of the cyl column is equal to 2:

vehicles[vehicles$cyl == 2, ]

Small example

df
##    A    B
## 1  1    a
## 2  2    b
## 3  3    c
## 4 NA    d
## 5 NA <NA>
df$A == 2
## [1] FALSE  TRUE FALSE    NA    NA
df[df$A == 2, ]
##       A    B
## 2     2    b
## NA   NA <NA>
## NA.1 NA <NA>

Small example: Fix

Fix 1: test that the value is not NA and is equal to 2

df[!is.na(df$A) & df$A == 2, ]
##   A B
## 2 2 b

Fix 2: use the which function

which(df$A == 2)
## [1] 2
df[which(df$A == 2), ]
##   A B
## 2 2 b

Function syntax

E.g. Take the mean of c(1,3,NA).

mean(c(1,3,NA))
## [1] NA
mean(c(1,3,NA), na.rm = TRUE)
## [1] 2

Agenda for today

Words vs. pictures

“The simple graph has brought more information to the data analyst’s mind than any other device.” - John Tukey

##     mpg weight cylinders
## 1  21.0  2.620         6
## 2  21.0  2.875         6
## 3  22.8  2.320         4
## 4  21.4  3.215         6
## 5  18.7  3.440         8
## 6  18.1  3.460         6
## 7  14.3  3.570         8
## 8  24.4  3.190         4
## 9  22.8  3.150         4
## 10 19.2  3.440         6
## 11 17.8  3.440         6
## 12 16.4  4.070         8
## 13 17.3  3.730         8
## 14 15.2  3.780         8
## 15 10.4  5.250         8
## 16 10.4  5.424         8
## 17 14.7  5.345         8
## 18 32.4  2.200         4
## 19 30.4  1.615         4
## 20 33.9  1.835         4
## 21 21.5  2.465         4
## 22 15.5  3.520         8
## 23 15.2  3.435         8
## 24 13.3  3.840         8
## 25 19.2  3.845         8
## 26 27.3  1.935         4
## 27 26.0  2.140         4
## 28 30.4  1.513         4
## 29 15.8  3.170         8
## 30 19.7  2.770         6
## 31 15.0  3.570         8
## 32 21.4  2.780         4

Two classes of variables in statistics

Barplots: counts for a categorical variable

What is the distribution of cylinders in my dataset?

Histograms: counts for a continuous variable

What is the distribution of miles per gallon in my dataset?

Scatterplots: continuous variable vs. continuous variable

What is the relationship between mpg and weight?

Lineplots: continuous variable vs. time variable

What is the relationship between mpg and time?

Not so good…

Easier to see the trend

Boxplots & violin plots: continuous variable vs. categorical variable

For each value of cylinder, what is the distribution of mpg like?

We can combine multiple plots in one graphic

We can combine multiple plots in one graphic

Summary

Case study

I have father-son pairs. For each pair, I record their height and weight, as well as their ethnicities. I want to study the relationship between characteristics of the father and that of the son. What plots could help me?

Data visualization in R: 2 broad approaches

base R

ggplot2

How can we describe a graphic?

Hadley Wickham

3 essential elements of graphics: data, geometries, aesthetics

Data: Dataset we are using for the plot

##     mpg weight cylinders
## 1  21.0  2.620         6
## 2  21.0  2.875         6
## 3  22.8  2.320         4
## 4  21.4  3.215         6
## 5  18.7  3.440         8
## 6  18.1  3.460         6
## 7  14.3  3.570         8
## 8  24.4  3.190         4
## 9  22.8  3.150         4
## 10 19.2  3.440         6

3 essential elements of graphics: data, geometries, aesthetics

Geometries: Visual elements used for our data

Geom: point

3 essential elements of graphics: data, geometries, aesthetics

Aesthetics: Defines the data columns which affect various aspects of the geom

3 different aesthetics:

Examples of other aesthetics

Examples of other aesthetics

Combining multiple plots into one graphic: Layers

We can have more than one layer in a graphic.

= +

Each layer contains (essentially):

ggplot2 code: take 1

Making use of ggplot’s sensible defaults:

ggplot() +
    geom_boxplot(data = df, mapping = aes(x = cylinders, y = mpg)) +
    geom_point(data = df, mapping = aes(x = cylinders, y = mpg))

ggplot2 code: take 2

Using jitter to avoid “overplotting”:

ggplot() +
    geom_boxplot(data = df, mapping = aes(x = cylinders, y = mpg)) +
    geom_point(data = df, mapping = aes(x = cylinders, y = mpg), 
               position = "jitter")

ggplot2 code: take 3

When layers share attributes, we only have to type them once:

ggplot(data = df, mapping = aes(x = cylinders, y = mpg)) +
    geom_boxplot() +
    geom_point(position = "jitter")

Today’s dataset: Diamonds

What makes an expensive diamond?
(Source: USA TODAY)









Optional material

Full specification of a graphic

One graphic contains:

Other grammatical elements: statistics

Behind the scenes, R may need to do some transformation on the dataset to make the graphic.

Other grammatical elements: position

Sometimes we need to tweak the position of the geometric elements because they obscure each other.

Only 9 data points??

Much better

Other grammatical elements: facets

Other grammatical elements: scales

Examples of scales (Source: A Layered Grammar of Graphics)

Scales example: colors

Default colors

Manually chosen colors

Scales example: x- & y-axes

Default axis limits

Manually chosen axis limits

Other grammatical elements: themes

Refers to all non-data ink

ggplot2’s default theme

Minimal theme

More pre-set themes

Classic theme

Dark theme

Shapes in R

Colors in R

Color scales in R